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Abstract 

We study clustering algorithms based on neighborhood graphs on a random sample 
of data points. The question we ask is how such a graph should be constructed in or- 
der to obtain optimal clustering results. Which type of neighborhood graph should 
one choose, mutual fc-nearest neighbor or symmetric /c-nearest neighbor? What is 
the optimal parameter kl In our setting, clusters are defined as connected compo- 
nents of the i-level set of the underlying probability distribution. Clusters are said 
to be identified in the neighborhood graph if connected components in the graph 
correspond to the true underlying clusters. Using techniques from random geometric 
graph theory, we prove bounds on the probability that clusters are identified suc- 
cessfully, both in a noise-free and in a noisy setting. Those bounds lead to several 
conclusions. First, k has to be chosen surprisingly high (rather of the order n than of 
the order log n) to maximize the probability of cluster identification. Secondly, the 
major difference between the mutual and the symmetric /c-nearest neighbor graph 
occurs when one attempts to detect the most significant cluster only. 

Key words: clustering, neighborhood graph, random geometric graph, connected 
component 



1 Introduction 

Using graphs to model real world problems is one of the most widely used 
techniques in computer science. This approach usually involves two major 
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steps: constructing an appropriate graph which represents the problem in a 
convenient way, and then constructing an algorithm which solves the problem 
on the given type of graph. While in some cases there exists an obvious natural 
graph structure to model the problem, in other cases one has much more choice 
when constructing the graph. In the latter cases it is an important question 
how the actual construction of the graph influences the overall result of the 
graph algorithm. 

The kind of graphs we want to study in the current paper are neighborhood 
graphs. The vertices of those graphs represent certain "objects", and ver- 
tices are connected if the corresponding objects are "close" or "similar". The 
best-known families of neighborhood graphs are e-neighborhood graphs and 
/c-nearest neighbor graphs. Given a number of objects and their mutual dis- 
tances to each other, in the first case each object will be connected to all 
other objects which have distance smaller than e, whereas in the second case, 
each object will be connected to its k nearest neighbors (exact definitions see 
below). Neighborhood graphs are used for modeling purposes in many areas 
of computer science: sensor networks and wireless ad-hoc networks, machine 
learning, data mining, percolation theory, clustering, computational geometry, 
modeling the spread of diseases, modeling connections in the brain, etc. 

In all those applications one has some freedom in constructing the neigh- 
borhood graph, and a fundamental question arises: how exactly should we 
construct the neighborhood graph in order to obtain the best overall result in 
the end? Which type of neighborhood graph should we choose? How should 
we choose its connectivity parameter, for example the parameter k in the 
&;-nearest neighbor graph? It is obvious that those choices will influence the 
results we obtain on the neighborhood graph, but often it is completely unclear 
how. 

In this paper, we want to focus on the problem of clustering. We assume that 
we are given a finite set of data points and pairwise distances or similarities 
between them. It is very common to model the data points and their distances 
by a neighborhood graph. Then clustering can be reduced to standard graph 
algorithms. In the easiest case, one can simply define clusters as connected 
components of the graph. Alternatively, one can try to construct minimal 
graph cuts which separate the clusters from each other. An assumption often 
made in clustering is that the given data points are a finite sample from 
some larger underlying space. For example, when a company wants to cluster 
customers based on their shopping profiles, it is clear that the customers in 
the company's data base are just a sample of a much larger set of possible 
customers. The customers in the data base are then considered to be a random 
sample. 

In this article, we want to make a first step towards such results in a simple 
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setting we call "cluster identification" (see next section for details). Clusters 
will be represented by connected components of the level set of the underly- 
ing probability density. Given a finite sample from this density, we want to 
construct a neighborhood graph such that we maximize the probability of clus- 
ter identification. To this end, we study different kinds of /c-nearest neighbor 
graphs (mutual, symmetric) with different choices of k and prove bounds on 
the probability that the correct clusters can be identified in this graph. One of 
the first results on the consistency of a clustering method has been derived by 
Hartigan [8], who proved "fractional consistency" for single linkage clustering. 

The question we want to tackle in this paper is how to choose the neighbor- 
hood graph in order to obtain optimal clustering results. The mathematical 
model for building neighborhood graphs on randomly sampled points is a 
geometric random graph, see Penrose [12] for an overview. Such graphs are 
built by drawing a set of sample points from a probability measure on R d , and 
then connecting neighboring points (see below for exact definitions). Note that 
the random geometric graph model is different from the classical Erdos-Renyi 
random graph model (cf. Bollobas [3] for an overview) where vertices do not 
have a geometric meaning, and edges are chosen independently of the vertices 
and independently of each other. In the setup outlined above, the choice of 
parameter is closely related to the question of connectivity of random geomet- 
ric graphs, which has been extensively studied in the random geometric graph 
community. Connectivity results are not only important for clustering, but also 
in many other fields of computer science such as modeling ad-hoc networks 
(e.g., Santi and Blough [13] . Bettstetter P, Kunniyur and Venkatesh [TO]) or 
percolation theory (Bollobas and Riordan [3]). The existing random geometric 
graph literature mainly focuses on asymptotic statements about connectivity, 
that is results in the limit for infinitely many data points. Moreover, it is usu- 
ally assumed that the underlying density is uniform - the exact opposite of 
the setting we consider in clustering. What we would need in our context are 
non- asymptotic results on the performance of different kinds of graphs on a 
finite point set which has been drawn from highly clustered densities. 

Our results on the choice of graph type and the parameter k for cluster iden- 
tification can be summarized as follows. Concerning the question of the choice 
of A;, we obtain the surprising result that k should be chosen surprisingly high, 
namely in the order of 0(n) instead of O(logn) (the latter would be the rate 
one would "guess" from results in standard random geometric graphs). Con- 
cerning the types of graph, it turns out that different graphs have advantages 
in different situations: if one is only interested in identifying the "most sig- 
nificant" cluster (while some clusters might still not be correctly identified), 
then the mutual kNN graph should be chosen. If one wants to identify many 
clusters simultaneously the bounds show no substantial difference between the 
mutual and the symmetric kNN graph. 
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2 Main constructions and results 



In this section we give a brief overview over the setup and techniques we use 
in the following. Mathematically exact statements follow in the next sections. 

Neighborhood graphs. We always assume that we are given n data points 
Xi, ...,X n which have been drawn i.i.d. from some probability measure which 
has a density with respect to the Lebesgue measure in R d . As distance function 
between points we use the Euclidean distance, which is denoted by dist. The 
distance is extended to sets A, B C ]R d via dist(A, B) = inf {dist(x, y) \ x G 
A,y G B}. The data points are used as vertices in an unweighted and undi- 
rected graph. By kNN(Xj) we denote the set of the k nearest neighbors of 
Xj among X ± , Xj_i, Xj +1 , X n . The different neighborhood graphs are 
defined as follows: 

• e -neighborhood graph G eps (n,e): Xi and Xj connected if dist(Xj, Xj) < e, 

• symmetric k -nearest-neighbor graph G sym (n, k): 

Xi and Xj connected if X t G kNN(X j ) or Xj G kNNpQ), 

• mutual k -nearest-neighbor graph G mut (n, k): 

Xi and Xj connected if Xi G kNN(X j ) and Xj G kNN(X). 

Note that the literature does not agree on the names for the different kNN 
graphs. In particular, the graph we call "symmetric" usually does not have a 
special name. 

Most questions we will study in the following are much easier to solve for 
e- neighbor hood graphs than for kNN graphs. The reason is that whether two 
points Xi and Xj are connected in the e-graph only depends on dist(JQ, Xj), 
while in the kNN graph the existence of an edge between Xi and Xj also 
depends on the distances of Xi and Xj to all other data points. However, the 
kNN graph is the one which is mostly used in practice. Hence we decided to 
focus on kNN graphs. Most of the proofs can easily be adapted for the e-graph. 

The cluster model. There exists an overwhelming amount of different def- 
initions of what clustering is, and the clustering community is far from con- 
verging on one point of view. In a sample based setting most definitions agree 
on the fact that clusters should represent high density regions of the data 
space which are separated by low density regions. Then a straight forward 
way to define clusters is to use level sets of the density. Given the underlying 
density p of the data space and a parameter t > 0, we define the t- level set 
L(t) as the closure of the set of all points x G IR d with p(x) > t. Clusters 
are then defined as the connected components of the t-level set (where the 
term "connected component" is used in its topological sense and not in its 
graph-theoretic sense). 
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Note that a different popular model is to define a clustering as a partition 
of the whole underlying space such that the boundaries of the partition lie 
in a low density area. In comparison, looking for connected components of 
t-level sets is a stronger requirement. Even when we are given a complete 
partition of the underlying space, we do not yet know which part of each of 
the clusters is just "background noise" and which one really corresponds to 
"interesting data" . This problem is circumvented by the t-level set definition, 
which not only distinguishes between the different clusters but also separates 
"foreground" from "background noise". Moreover, the level set approach is 
much less sensitive to outliers, which often heavily influence the results of 
partitioning approaches. 

The cluster identification problem. Given a finite sample from the un- 
derlying distribution, our goal is to identify the sets of points which come from 
different connected components of the t-level set. We study this problem in 
two different settings: 

The noise-free case. Here we assume that the support of the density con- 
sists of several connected components which have a positive distance to each 
other. Between those components, there is only "empty space" (density 0). 
Each of the connected components is called a cluster. Given a finite sample 
Xi , . . . , X n from such a density, we construct a neighborhood graph G based 
on this sample. We say that a cluster is identified in the graph if the 
connected components in the neighborhood graph correspond to the corre- 
sponding connected components of the underlying density, that is all points 
originating in the same underlying cluster are connected in the graph, and 
they are not connected to points from any other cluster. 

The noisy case. Here we no longer assume that the clusters are separated 
by "empty space", but we allow the underlying density to be supported ev- 
erywhere. Clusters are defined as the connected components of the t-level set 
L(t) of the density (for a fixed parameter t chosen by the user), and points not 
contained in this level set are considered as background noise. A point iGR 1 * 
is called a cluster point if x G L(t) and background point otherwise. As in the 
previous case we will construct a neighborhood graph G on the given sample. 
However, we will remove points from this graph which we consider as noise. 
The remaining graph G will be a subgraph of the graph G, containing fewer 
vertices and fewer edges than G. As opposed to the noise-free case, we now 
define two slightly different cluster identification problems. They differ in the 
way background points are treated. The reason for this more involved con- 
struction is that in the noisy case, one cannot guarantee that no additional 
background points from the neighborhood of the cluster will belong to the 
graph. 
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We say that a cluster is roughly identified in the remaining graph G if 
the following properties hold: 

• all sample points from a cluster are contained as vertices in the graph, that 
is, only background points are dropped, 

• the vertices belonging to the same cluster are connected in the graph, that 
is, there exists a path between each two of them, and 

• every connected component of the graph contains only points of exactly one 
cluster (and maybe some additional noise points, but no points of a different 
cluster). 

We say that a cluster is exactly identified in G if 

• it is roughly identified, and 

• the ratio of the number of background points and the number of cluster 
points in the graph G converges almost surely to zero as the sample size 
approaches infinity. 

If all clusters have been roughly identified, the number of connected compo- 
nents of the graph G is equal to the number of connected components of the 
level set L(t). However, the graph G might still contain a significant number of 
background points. In this sense, exact cluster identification is a much stronger 
problem, as we require that the fraction of background points in the graph 
has to approach zero. Exact cluster identification is an asymptotic statement, 
whereas rough cluster identification can be verified on each finite sample. Fi- 
nally, note that in the noise-free case, rough and exact cluster identification 
coincide. 

The clustering algorithms. To determine the clusters in the finite sample, 
we proceed as follows. First, we construct a neighborhood graph on the sample. 
This graph looks different, depending on whether we allow noise or not: 

Noise-free case. Given the data, we simply construct the mutual or symmetric 
/c-nearest neighbor graph (G mut (n, k) resp. G sym (n, k)) on the data points, for 
a certain parameter k, based on the Euclidean distance. Clusters are then the 
connected components of this graph. 

Noisy case. Here we use a more complex procedure: 

• As in the noise-free case, construct the mutual (symmetric) kNN graph 
G mu t(n, k) (resp. G sym (n, k)) on the samples. 

• Estimate the density p n pQ) at every sample point Xi (e.g., by kernel density 
estimation). 

• If p n (Xi) < t', remove the point Xi and its adjacent edges from the graph 
(where t' is a parameter determined later). The resulting graph is denoted 
by G 'mut ( n , k , ?) (resp. G" sym (n, k, t')). 

• Determine the connected components of G' nmt (n, k, t') (resp. G' (n, k, t')), 
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for example by a simple depth-first search. 

• Remove the connected components of the graph that are "too small" , that is, 
which contain less than Sn points (where 5 is a small parameter determined 
later). 

• The resulting graph is denoted by G mut (n, k, t', 5) (resp. G syIa (n,k,t',S)); 
its connected components are the clusters of the sample. 

Note that by removing the small components in the graph the method becomes 
very robust against outliers and "fake" clusters (small connected components 
just arising by random fluctuations). 

Main results, intuitively. We would like to outline our results briefly in 
an intuitive way. Exact statements can be found in the following sections. 

Result 1 (Range of k for successful cluster identification) Under mild 
assumptions, and for n large enough, there exist constants C\,C2 > such that 
for any k G [ci log n, C2n] , all clusters are identified with high probability in 
both the mutual and symmetric kNN graph. This result holds for cluster iden- 
tification in the noise-free case as well as for the rough and the exact cluster 
identification problem (the latter seen as an asymptotic statement) in the noisy 
case (with different constants 01,02). 

For the noise-free case, the lower bound on k has already been proven in Brito 
et al. [5], for the noisy case it is new. Importantly, in the exact statement of 
the result all constants have been worked out more carefully than in Brito 
et al. [5J, which is very important for proving the following statements. 

Result 2 (Optimal k for cluster identification) Under mild assumptions, 
and for n large enough, the parameter k which maximizes the probability of 
successful identification of one cluster in the noise-free case has the form k = 
c\n+C2, where c\, C2 are constants which depend on the geometry of the cluster. 
This result holds for both the mutual and the symmetric kNN graph, but the 
convergence rates are different (see Result 3). A similar result holds as well 
for rough cluster identification in the noisy case, with different constants. 

This result is completely new, both in the noise-free and in the noisy case. In 
the light of the existing literature, it is rather surprising. So far it has been well 
known that in many different settings the lower bound for obtaining connected 
components in a random kNN graph is of the order k ~ log n. However, we now 
can see that maximizing the probability of obtaining connected components on 
a finite sample leads to a dramatic change: k has to be chosen much higher 
than log n, namely of the order n itself. Moreover, we were surprised ourselves 
that this result does not only hold in the noise-free case, but can also be carried 
over to rough cluster identification in the noisy setting. 
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For exact cluster identification we did not manage to determine an optimal 
choice of k due to the very difficult setting. For large values of k, small com- 
ponents which can be discarded will no longer exist. This implies that a lot 
of background points are attached to the real clusters. On the other hand, for 
small values of k there will exist several small components around the cluster 
which are discarded, so that there are less background points attached to the 
final cluster. However, this tradeoff is very hard to grasp in technical terms. 
We therefore leave the determination of an optimal value of k for exact cluster 
identification as an open problem. Moreover, as exact cluster identification 
concerns the asymptotic case of n — > oo only, and rough cluster identification 
is all one can achieve on a finite sample anyway, we are perfectly happy to be 
able to prove the optimal rate in that case. 

Result 3 (Identification of the most significant cluster) For the opti- 
mal k as stated in Result 2, the convergence rate (with respect to n) for the 
identification of one fixed cluster is different for the mutual and the sym- 
metric kNN graph. It depends 

• only on the properties of the cluster itself in the mutual kNN graph 

• on the properties of the "least significant" , that is the "worst" out of all 
clusters in the symmetric kNN graph. 

This result shows that if one is interested in identifying the "most significant" 
clusters only, one is better off using the mutual kNN graph. When the goal is to 
identify all clusters, then there is not much difference between the two graphs, 
because both of them have to deal with the "worst" cluster anyway. Note 
that this result is mainly due to the different between-cluster connectivity 
properties of the graphs, the within-cluster connectivity results are not so 
different (using our proof techniques at least). 

Proof techniques, intuitively. Given a neighborhood graph on the sample, 
cluster identification always consists of two main steps: ensuring that points 
of the same cluster are connected and that points of different clusters are not 
connected to each other. We call those two events "within-cluster connected- 
ness" and "between-cluster disconnectedness" (or "cluster isolation"). 

To treat within-cluster connectedness we work with a covering of the true 
cluster. We cover the whole cluster by balls of a certain radius z. Then we 
want to ensure that, first, each of the balls contains at least one of the sample 
points, and second, that points in neighboring balls are always connected in 
the kNN graph. Those are two contradicting goals. The larger z is, the easier 
it is to ensure that each ball contains a sample point. The smaller z is, the 
easier it is to ensure that points in neighboring balls will be connected in the 
graph for a fixed number of neighbors k. So the first part of the proof consists 
in computing the probability that for a given z both events occur at the same 
time and finding the optimal z. 
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Between-cluster connectivity is easier to treat. Given a lower bound on the 
distance u between two clusters, all we have to do is to make sure that edges in 
the kNN graph never become longer than u, that is we have to prove bounds 
on the maximal kNN distance in the sample. 

In general, those techniques can be applied with small modifications both in 
the noise- free and in the noisy case, provided we construct our graphs in the 
way described above. The complication in the noisy case is that if we just 
used the standard kNN graph as in the noise-free case, then of course the 
whole space would be considered as one connected component, and this would 
also show up in the neighborhood graphs. Thus, one has to artificially reduce 
the neighborhood graph in order to remove the background component. Only 
then one can hope to obtain a graph with different connected components 
corresponding to different clusters. The way we construct the graph G ensures 
this. First, under the assumption that the error of the density estimator is 
bounded by e, we consider the (t — e)-level set instead of the t-level set we 
are interested in. This ensures that we do not remove "true cluster points" in 
our procedure. A second, large complication in the noisy case is that with a 
naive approach, the radius z of the covering and the accuracy e of the density 
estimator would be coupled to each other. We would need to ensure that the 
parameter e decreases with a certain rate depending on z. This would lead to 
complications in the proof as well as very slow convergence rates. The trick 
by which we can avoid this is to introduce the parameter 5 and throw away 
all connected components which are smaller than 5n. Thus, we ensure that no 
small connected components are left over in the boundary of the (t — e)-level 
set of a cluster, and all remaining points which are in this boundary strip will 
be connected to the main cluster represented by the t-level set. Note, that this 
construction allows us to estimate the number of clusters even without exact 
estimation of the density. 

Building blocks from the literature. To a certain extent, our proofs 
follow and combine some of the techniques presented in Brito et al. [5] and 
Biau et al. [2]. 

In Brito et al. [5] the authors study the connectivity of random mutual k- 
nearest neighbor graphs. However, they are mainly interested in asymptotic 
results, only consider the noise-free case, and do not attempt to make state- 
ments about the optimal choice of k. Their main result is that in the noise- 
free case, choosing k at least of the order O(logn) ensures that in the limit 
for n —>■ oo, connected components of the mutual fc-nearest neighbor graph 
correspond to true underlying clusters. 

In Biau et al. [2], the authors study the noisy case and define clusters as 
connected components of the t-level set of the density. As in our case, the 
authors use density estimation to remove background points from the sample, 
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but then work with an e-neighborhood graph instead of a /c-nearest neighbor 
graph on the remaining sample. Connectivity of this kind of graph is much 
easier to treat than the one of fc-nearest neighbor graphs, as the connectivity 
of two points in the e-graph does not depend on any other points in the 
sample (this is not the case in the /c-nearest neighbor graph). Then, Biau et al. 
[2] prove asymptotic results for the estimation of the connected components 
of the level set L(t), but also do not investigate the optimal choice of their 
graph parameter e. Moreover, due to our additional step where we remove 
small components of the graph, we can provide much faster rates for the 
estimation of the components, since we have a much weaker coupling of the 
density estimator and the clustering algorithm. 

Finally, note that a considerably shorter version of the current paper dealing 
with the noise-free case only has appeared in Maier et al. [11]. In the current 
paper we have shortened the proofs significantly at the expense of having 
slightly worse constants in the noise-free case. 



3 General assumptions and notation 



Density and clusters. Let p be a bounded probability density with respect 
to the Lebesgue measure on R . The measure on M. d that is induced by the 
density p is denoted by \i. Given a fixed level parameter t > 0, the t-level set 
of the density p is defined as 

L(t) = {x G R d | p(x) > t}. 

where the bar denotes the topological closure (note that level sets are closed 
by assumptions in the noisy case, but this is not necessarily the case in the 
noise- free setting). 

Geometry of the clusters. We define clusters as the connected components 
of L(t) (where the term "connected component" is used in its topological 
sense). The number of clusters is denoted by m, and the clusters themselves 
by C«, . . . ,C( m \ We set /%) := /x(C (i) ), that means, the probability mass in 
cluster C (i) . 

We assume that each cluster (i = 1, . . . , m) is a disjoint, compact and 
connected subset of lR d , whose boundary dC^ is a smooth {d— l)-dimensional 
submanifold in W 1 with minimal curvature radius re® > (the inverse of 
the largest principal curvature of dC^'). For v < we define the collar 
set Col^\p) = {x E dist(x , dC^>) < v} and the maximal covering 

radius i/W^ = max u<K (i){u \ C^ l > \ Col^iy) connected }. These quantities will 
be needed for the following reasons: It will be necessary to cover the inner 
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part of each cluster by balls of a certain fixed radius z, and those balls are 
not supposed to "stick outside". Such a construction is only possible under 
assumptions on the maximal curvature of the boundary of the cluster. This 
will be particularly important in the noisy case, where all statements about 
the density estimator only hold in the inner part of the cluster. 

For an arbitrary e > 0, the connected component of L(t—e) which contains the 
cluster is denoted by C_ (e). Points in the set C_ (e)\CW will sometimes 
be referred to as boundary points. To express distances between the clusters, 
we assume that there exists some e > such that dist(C^(2e), C^(2£)) > 
«W > for all i,j E {1, . . . , m}. The numbers u^' will represent lower bounds 
on the distances between cluster C^> and the remaining clusters. Note that 
the existence of the > ensures that Ci^(2e) does not contain any other 
clusters apart from for e < e. Analogously to the definition of j3u\ above 
we set = fi(C®(2e)), that is the mass of the enlarged set C_ (2e). These 
definitions are illustrated in Figure [T} Furthermore, we introduce a lower bound 
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Fig. 1. An example of our cluster definition. The clusters C^ l \ are defined 
as the connected components of the t-level set of the density (here t = 0.07). The 
clusters are subsets of the sets C'i 1) (2e), ci 2) (2e) (here for e = 0.01). 

on the probability mass in balls of radius around points in C^}(2e) 



p {l) < inf /jl(b{x,u®)). 

xdC ( t ) (2e) V ' 



In particular, under our assumptions on the smoothness of the cluster bound- 
ary we can set p^ = O^ 1 ' (u^)tr]d(u^) d for an overlap constant 

O w (n«)= inf (vol{B{x,u {i) ) n {2i))/ vo\{B{x,u^))) > 0. 

z6Ci l) (2e) V ' 

The way it is constructed, p^ becomes larger the larger the distance of 

to all the other clusters is and is upper bounded by the probability mass of 

the extended cluster (3^. 
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Example in the noisy case. All assumptions on the density and the clusters 
are satisfied if we assume that the density p is twice continuously differentiable 
on a neighborhood of {p = t}, for each x G {p = t} the gradient of p at x is 
non-zero, and dist(Cw, C^) — u' > u®. 

Example in the noise-free case. Here we assume that the support of the 
density p consists of m connected components . . . , C <m - ) which satisfy the 
smoothness assumptions above, and such that the densities on the connected 
components are lower bounded by a positive constant t. Then the noise-free 
case is a special case of the noisy case. 

Sampling. Our n sample points Xi,...,X n will be sampled i.i.d. from the 
underlying probability distribution. 

Density estimation in the noisy case. In the noisy case we will estimate 
the density at each data point Xj by some estimate p n (Xj). For convenience, 
we state some of our results using a standard kernel density estimator, see 
Devroye and Lugosi [6] for background reading. However, our results can easily 
be rewritten with any other density estimate. 

Further notation. The kNN radius of a point Xj is the maximum distance 
to a point in kNN(JQ). i? m ) n denotes the minimal kNN radius of the sample 
points in cluster C^ % \ whereas -R m Lc denotes the maximal kNN radius of the 
sample points in C^{2e). Note here the difference in the point sets that are 
considered. 

Bin(n,p) denotes the binomial distribution with parameters n and p. Prob- 
abilistic events will be denoted with curly capital letters A,B,..., and their 
complements with A c , B c , .... 



4 Exact statements of the main results 

In this section we are going to state all our main results in a formal way. In 
the statement of the theorems we need the following conditions. The first one 
is necessary for both, the noise-free and the noisy case, whereas the second 
one is needed for the noisy case only. 

• Condition 1: Lower and upper bounds on the number of neighbors k, 

k > 4 d+1 ^log(28 d p m Lvol(C«)n), 

k < (n - 1) min { ^ - ^W") , A nun {(u^f, 
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Table 1 

Table of notations 



p{x) 


density 


Pn{X) 


density estimate in point x 


t 


density level set parameter 


L{t) 


t-level set of p 


C [L) , . . . , c [m) 


clusters, i.e. connected components of L(t) 


C [ _'(e) 


connected component of Lit — e) containing 


/%)> /%) 


probability mass of and C*i^ (2e) respectively 


(i) 
Pmax 


maximal density in cluster CW 




probability of balls of radius it^ around points in C^(2i) 




1 i 1" Pill 1 O [71 

minimal curvature radius of the boundary oL K ' 


"max 


maximal covering radius of cluster C"W 


Col^\v) 


collar set for radius v 




lower bound on the distances between and other clusters 


€ 


parameter such that dist(C W (2e), C 0) (2e)) > «W for all e < e 


Vd 


volume of the ti-dimensional unit ball 


k 


number of neighbors in the construction of the graph 



• Condition 2: The density p is three times continuously differentiable with 
uniformly bounded derivatives, /%) > 25, and e n sufficiently small such that 
/i(U(^ ) (2e n )\CW)) <5/2. 

Note that in Theorems 1 to 3 e n is considered small but constant and thus we 
drop the index n there. 

In our first theorem, we present the optimal choice of the parameter k in the 
mutual kNN graph for the identification of a cluster. This theorem treats both, 
the noise-free and the noisy case. 

Theorem 1 (Optimal k for identification of one cluster in the mu- 
tual kNN graph) The optimal choice of k for identification of cluster 
in G mut (n, k) (noise-free case) resp. rough identification in G mut (n, k,t — e, 5) 
(noisy case) is 

k — (n— l)r« + 1, with r« := V"*-> 

2 + A^ZFT 

rmax 
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provided this choice of k fulfills Condition 1. 

In the noise-free case we obtain with ^nlisefree = an d f or su ffi~ 

ciently large n 



P (^Cluster is identified in G mut (n, k)^j > 1 — 3e~ 



{n-l)n (i \ . 

v 1 noisefn 



For the noisy case, assume that additionally Condition 2 holds and let p n be a 
kernel density estimator with bandwidth h. Then there exist constants C\,C<i 
such that if h 2 < C\S we get with 



- min p n 6 n r h d F 2 

1 2 4 d+1 Pina - X + 4 n ~ J-on — 1 



t 



and for sufficiently large n 
Y {Cluster C w roughly identified in G mut (n, k,t-e,5)j > 1 - 8e -( "~ 1)n 



(<) 

noisy 



This theorem has several remarkable features. First of all, we can see that 
both in the noise-free and in the noisy case, the optimal choice of k is roughly 
linear in n. This is pretty surprising, given that the lower bound for cluster 
connectivity in random geometric graphs is k ~ logn. We will discuss the 
important consequences of this result in the last section. 

Secondly, we can see that for the mutual kNN graph the identification of one 
cluster only depends on the properties of the cluster but not on 

the ones of any other cluster. This is a unique feature of the mutual kNN 
graph which comes from the fact that if cluster is very "dense", then 
the neighborhood relationship of points in never links outside of cluster 
C( l \ In the mutual kNN graph this implies that any connections of to 
other clusters are prevented. Note that this is not true for the symmetric 
kNN graph, where another cluster can simply link into no matter which 
internal properties has. 

For the mutual graph, it thus makes sense to define the most significant cluster 
as the one with the largest coefficient since this is the one which can be 
identified with the fastest rate. In the noise-free case one observes that the 
coefficient of cluster is large given that 

• is large, which effectively means a large distance of to the closest 
other cluster, 

• p^ax/t is small, so that the density is rather uniform inside the cluster C^\. 
Note that those properties are the most simple properties one would think of 
when imagining an "easily detectable" cluster. For the noisy case, a similar 
analysis still holds as long as one can choose the constants 5, h and e small 
enough. 
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Formally, the result for identification of clusters in the symmetric kNN graph 
looks very similar to the one above. 

Theorem 2 (Optimal k for identification of one cluster in the sym- 
metric kNN graph) We use the same notation as in Theorem [7] and define 
p m ; n = min^i,...^ . Then all statements about the optimal rates for k in 
Theorem [7] can be carried over to the symmetric kNN graph, provided one 
replaces p® with p min in the definitions of , ^tnoisefree an & ^nlisy U Con- 
dition 1 holds and the condition k < (n — l)p min /2 — 21og(n) replaces the 
corresponding one in Condition 1, we have in the noise-free case for suffi- 
ciently large n 

p(C^ is identified in G sym (n, k)j > 1 — (m + 2)e~ <Kn ~ 1 ^ Q ' nmse f r < :e . 

If additionally Condition 2 holds we have in the noisy case for sufficiently 
large n 

P(C W roughly identified in G sym (n, k, t - e, 5)j > 1 - (to + 7) e ~ {n ~ 1)n ™"y. 



Observe that the constant pW has now been replaced by the minimal p^ 
among all clusters This means that the rate of convergence for the sym- 
metric kNN graph is governed by the constant p^' of the "worst" cluster, that 
is the one which is most difficult to identify. Intuitively, this worst cluster is 
the one which has the smallest distance to its neighboring clusters. In contrast 
to the results for the mutual kNN graph, the rate for identification of in 
the symmetric graph is governed by the worst cluster instead of the cluster 
C« itself. This is a big disadvantage if the goal is to only identify the "most 
significant" clusters. For this purpose the mutual graph has a clear advantage. 

On the other hand as we will see in the next theorem that the difference in 
behavior between the mutual and symmetric graph vanishes as soon as we 
attempt to identify all clusters. 

Theorem 3 (Optimal k for identification of all clusters in the mu- 
tual kNN graph) We use the same notation as in Theorem [7] and define 
p min = min i= i v .. im pW ; p max = max i=lr .. >m p m ^ x . The optimal choice of k for 
the identification of all clusters in the mutual kNN graph in G mu t(n, k) (noise- 
free case) resp. rough identification of all clusters in G mu t (n, k,t — e, 5) (noisy 
case) is given by 

k = (n-l)T all +l, with T all = P T_^ , 

m ax 



15 



provided this choice of k fulfills Condition 1 for all clusters . In the noise- 
free case we get the rate 

Q . , — *-E™ 

^noisefree ^ ^ d+1 + ^ > 

such that for sufficiently large n 

v(AU clusters exactly identified in G mut (n,k)^ > 1 - 3777 e~ (n ~ 1)nnoise/r<ie . 

For the noisy case, assume that additionally Condition 2 holds for all clusters 
and let p n be a kernel density estimator with bandwidth h. Then there exist 
constants Ci,C 2 such that if h 2 < C±e we get with 

• ! Pmin ^ $ ^ 7 A o I 

sL oi ™ = mm < — j-— s 1 1 Coh e > 

noisy | 2 4 d+1 2^+4' 71-18 ' 71 - 1 J 

and for sufficiently large n 

P^All clusters roughly ident. in G mut (n,k,t - e,Sfj > l-(3ra+5) e ~ in ~ 1)Q - no " v . 



We can see that as in the previous theorem, the constant which now governs 
the speed of convergence is the worst case constant among all the In the 
setting where we want to identify all clusters this is unavoidable. Of course 
the identification of "insignificant" clusters will be difficult, and the overall 
behavior will be determined by the most difficult case. This is what is re- 
flected in the above theorem. The corresponding theorem for identification of 
all clusters in the symmetric kNN graph looks very similar, and we omit it. 

So far for the noisy case we mainly considered the case of rough cluster iden- 
tification. As we have seen, in this setting the results of the noise-free case are 
very similar to the ones in the noisy case. Now we would like to conclude with 
a theorem for exact cluster identification in the noisy case. 

Theorem 4 (Exact identification of clusters in the noisy case) Let p 

be three times continuously differentiable with uniformly bounded derivatives 
and letp n be a kernel density estimator with bandwidth h n = ho(logn/n) 1 ^ d+4: ^ 
for some h > 0. For a suitable constant e > set e n = e (log 77/77) 2 /( d + 4 ). 
Then there exist constants c±, c 2 such that for n — > 00 and c\ log 77 < k < c^n 
we obtain 

Cluster C^ is exactly identified in G mut (n, k,t — e n , 5) almost surely. 



Note that as opposed to rough cluster identification, which is a statement 
about a given finite nearest neighbor graph, exact cluster identification is an 
inherently asymptotic property. The complication in this asymptotic setting 
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is that one has to balance the speed of convergence of the density estimator 
with the one of the "convergence of the graph". The exact form of the den- 
sity estimation is not important. Every other density estimator with the same 
convergence rate would yield the same result. One can even lower the assump- 
tions on the density to p G C 1 (lR d ) (note that differentiability is elsewhere 
required). Finally, note that since it is technically difficult to grasp the graph 
after the small components have been discarded, we could not prove what the 
optimal k in this setting should be. 



5 Proofs 



The propositions and lemmas containing the major proof steps are presented 
in Section |5.1| The proofs of the theorems themselves can be found in Section 



5.2 An overview of the proof structure can be seen in Figure [2} 



Theorem 1 
Theorem 2 
Theorem 3 



Theorem 4 



Prop. 1 








Lemma 3 
I 



Lemma 4 



Lemma 5 



Lemma 7 



Lemma 9 



Lemma 2 



Fig. 2. The structure of our proofs. Proposition [T] deals with within-cluster con- 
nectedness and Proposition [6] with between-cluster disconnectedness. Proposition [8] 
bounds the ratio of background and cluster points for the asymptotic analysis of 
exact cluster identification. 



5. 1 Main propositions for cluster identification 



In Proposition [T] we identify some events whose combination guarantee the 
connectedness of a cluster in the graph and at the same time that there is not 
a connected component of the graph that consists of background points only. 
The probabilities of the events appearing in the proposition are then bounded 
in Lemma [2](5j In Proposition [6] and Lemma [7] we examine the probability of 
connections between clusters. The section concludes with Proposition [8] and 
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Lemma [9j which are used in the exact cluster identification in Theorem |4[ and 
some remarks about the differences between the noise-free and the noisy case. 

Proposition 1 (Connectedness of one cluster in the noisy case) 

Let C$ denote the event that in G mut (n, k,t — e n , S) (resp. G sym (n, k,t — e n , S) ) 
it holds that 

• all the sample points from are contained in the graph, 

• the sample points from are connected in the graph, 

• there exists no component of the graph which consists only of sample points 
from outside L(t). 

Then under the conditions 

(1) /3 W > 26, 

(2) e n sufficiently small such that /x(u(Ci i] '(2e n )\CW)) < 5/2, 

(3) k > 4 d+1 ^log(2 8 d p« x vol(CW)n), 

k < (n-l)24 d % p« ax min{( M «) d ,(^) x ) rf } , 

and for sufficiently large n, we obtain 

p((c«) c ) < p((^) c ) + p((£«r) + p(^) + pm 

_ k-l t 

<2e 4d+1 + 2e~ n l + 2P(V c n ), 
where the events are defined as follows: 

• A$ : the subgraph consisting of points from is connected in G' mut (n, k, t— 
e n ) (resp. G' sym (n, k, t - e n )), 

• B$: there are more than 5n sample points from cluster C^ % \ 

• S n : there are less than 5n sample points in the set Ui [C^l (2e n )\C^n , and 

• V n : \p n (Xi) — p(Xi)\ < e n for all sample points X i} i — 1, . . . , n. 

Proof. We bound the probability of C%> using the observation that A$ fl B$ fl 
£ n r]V n C C® implies 

P((C«) C ) < P((A®) C ) + P((B«) C ) + P(S c n ) + P(V c n ). (1) 

This follows from the following chain of observations. If the event T> n holds, 
no point with p(Xi) > t is removed, since on this event p(Xi) — p n (Xi) < e n 
and thus p n (Xi) > p{Xi) — e n > t — e n , which is the threshold in the graph 
G'(n, k,t-e n ). 

If the samples in cluster are connected in G'(n, k,t — e n ) (A$), and there 
are more than 5n samples in cluster {B$), then the resulting component of 
the graph G'(n, k,t — e n ) is not removed in the algorithm and is thus contained 
in G(n, k,t — e n , 5). 
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Conditional on T> n all remaining samples are contained in UjC- (2e n )- Thus 
all non cluster samples lie in \J i (C_(2e n )\C^). Given that this set contains 
less than 5n samples, there can exist no connected component only consisting 
of non cluster points, which implies that all remaining non cluster points are 
connected to one of the clusters. 

The probabilities for the complements of the events A$, B$ and E n are 
bounded in Lemmas 3 to 5 below. Plugging in those bounds into Equation ([TJ 
leads to the desired result. □ 

We make frequent use of the following tail bounds for the binomial distribution 
introduced by Hoeffding. 

Theorem 5 (Hoeffding, [9|) Let M ~ Bin(n,p) and define a = k/n. Then, 

a>p, P{M >k) <e~ nK(allp \ 
a<p, P(M <k) <e- nK{allp) , 

where K(a\\p) is the Kullback-Leibler divergence of (a, 1 — a) and (p, 1 — p), 
K(a\\p) = a log ( a ) + (1 - a) log (\^)- 

In the following lemmas we derive bounds for the probabilities of the events 
introduced in the proposition above. 

Lemma 2 (Within-cluster connectedness (*4^)) As in Proposition^let 
A$ denote the event that the points of cluster CW are connected in 
G' mut {n,k,e n ) (resp. G' sym {n,k,e n )). For z £ (0,4 minfyW, i/^J), 

P((^ } ) j < n/%) P(M > k) + N (l - t^pf + P(v c n ), 

where M is a Bin(n — l,p^ ax r)dZ d ) -distributed random variable and N < 
(8 d vo\(C^))/(z d r] d ). 

Proof. Given that T> n holds, all samples lying in cluster are contained in 
the graph G'(n,k,e n ). Suppose that we have a covering of C^\C ol^(z / A) 
with balls of radius |. By construction every ball of the covering lies entirely 
in C (i) , so that t is a lower bound for the minimal density in each ball. If 
every ball of the covering contains at least one sample point and the minimal 
kNN radius of samples in is larger or equal to z, then all samples of 
C^\Col^'(z/A) are connected in G'(n, k,e n ) given that z < 4z/^ x . Moreover, 
one can easily check that all samples lying in the collar set CoW\z/A) are 
connected to C^\Col^\z/A). In total, then all samples points lying in 
are connected. Denote by J-f' the event that one ball in the covering with 
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balls of radius z/A contains no sample point. Formally, {ivL > z} D {T^) c 
implies connectedness of the samples lying in in the graph G'(n, k,e n ). 

Define N s = \{j ^ s | Xj G B(X s ,z)}\ for 1 < s < n. Then {R^ < z) = 
U" =1 {{A^ S > k} n {X s G C^}}. We have 

n 

where C/ ~ Bin(n — 1, sup^^) (i(B(x,z)). The final result is obtained using 
the upper bound sup^g^i) /i(-B(x, z)) < PmaxVdZ d ■ 

For the covering a standard construction using a z/4-packing provides us with 
the covering. Since z/A < z/^ x we know that balls of radius z/8 around the 
packing centers are subsets of C'*' and disjoint by construction. Thus, the 
total volume of the N balls is bounded by the volume of C^ 1 ' and we get 
N(z/8) d r]d < vol(Cw). Since we assume that T> n holds, no sample lying in 
has been discarded. Thus the probability for one ball of the covering 
being empty can be upper bounded by (1 — tr]dZ d /4: d ) n , where we have used 
that the balls of the covering are entirely contained in C^' and thus the density 
is lower bounded by t. In total, a union bound over all balls in the covering 
yields, 

P(F®) < N(l-tri d z d /4 d ) n + P(p c n ). 

Plugging both results together yields the final result. □ 

In Lemma [2] we provided a bound on the probability which includes two com- 
peting terms for the choice of z. One favors small z whereas the other favors 
large z. The next lemma will provide a trade-off optimal choice of the radius 
z in terms of k. 

Lemma 3 (Choice of k for within-cluster connectedness (A$)) If k ful- 
fills Condition (3) of Proposition^ we have for sufficiently large n 

P({Afr)<2e~ U ^t+P{V c n ). 



Proof. The upper bound on the probability of (A$) c given in Lemma [2] has 
two terms dependent on z. The tail bound for the binomial distribution is 
small if z is chosen to be small, whereas the term from the covering is small 
given that z is large. Here, we find a choice for z which is close to optimal. 
Define p = P < $ ax r}dZ d and a = k/{n — 1). Using Theorem we obtain for 
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M ~ Bin(n — l,p) and a choice of z such that p < a, 

n/? (l) P(M >k)< n^AM^-M^)) 

where we have used log(z) > (z — l)/z for z > 0. Now, introduce # := r] c iz d /a 
so that p = p^ ax a, where with p < a we get, < 9p^ x < 1. Then, 



n/%)P(M> fc) <n/3 (i) e 



fc( log (-7^—) +epSL-i 

Prnax* 



-!(i° g (^-)+0pEx-i 



< e v vlnix^ 7 , (2) 

where we used in the last step an upper bound on the term n/?^) which holds 
given k > (21og( / 9 w n))/(log(l/(0pW x )) + 9p^ - l). On the other hand, 

iV(l - t7] d Z d /A d ) n = Ne nlog(l-t Vd z^/4") < xe-ntruz*/** 

where we used log(l — x) > — x for x < 1. With r]dZ d = 9a and the upper 
bound on N we get using n/{n — 1) > 1, 

<e" fc ^, (3) 

where the last step holds given k > ^ log ^ nvol (^ ( — ^ . Upper bounding the 
bound in ^ with the one in ^ requires, 

te .1,. , i N (0 

max 



24^2l' 0g W +< " > »»- 1 »' 

Introduce, 7 = #p m ) ax , then this is equivalent to 7^/(4 d Pm ax ) — ( — + 
7 — 1). Note, that t/(4: d p^ ax ) — V^- Thus, the above inequality holds for all 
d > 1 given that — log(7) > 1 — 37/4. A simple choice is 7 = 1/2 and thus 
9 = l/(2p^ ax ), which fulfills 9p$ ax < 1. In total, we obtain with the result 
from Lemma [2j 

P((^) c ) < 2e^^ +P(Z>£) < 2e _I3TT SL + P . 

We plug in the choice of 9 into the lower bounds on k. One can easily find an 
upper bound for the maximum of the two lower bounds which gives, 

k > 4 d+1 ^log(2 8 d Pm L vol(C«)n). 
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The upper bound, z < 4min{wW, f^^}, translates into the following upper 
bound on k, k < (n - l)2 4 d r/ d pW ax min (y®J d }. □ 

The result of this lemma means that if we choose k > c\ + C2 log n with two 
constants c±, ci that depend on the geometry of the cluster and the respective 
density, then the probability that the cluster is disconnected approaches zero 
exponentially in k. 

Note that due to the constraints on the covering radius, we have to introduce 
an upper bound on k which depends linearly on n. However, as the probability 
of connectedness is monotonically increasing in k, the value of the within- 
connectedness bound for this value of k is a lower bound for all larger k as 
well. Since the lower bound on k grows with log n and the upper bound grows 
with n, there exists a feasible region for k if n is large enough. 

Lemma 4 (Event B$) As in Proposition [7] let B$ denote the event that 
there are more than 5n sample points from cluster If @u\ > 5 then 



P((H«?) C )<» P (-^.>(^ 



■i) J y 



Proof. Let be the number of samples in cluster Then, 
P(M W < Sri) < P^M^ < J-Pqti) < exp ^ - ^n(3 r ^ W ~ 



where we used ~ Bin(n, /3(j)) and a Chernoff bound. □ 



Lemma 5 (Event £ n ) As in Proposition^ let £ n denote the event that there 
are less than Sn sample points in all the boundary sets C_ (2e n )\C^ together. 
IfET=i^(C {j \2e n ) \ C®) < 5/2, we have P(££) < exp(-5n/8). 

Proof. By assumption, for the probability mass in the boundary strips we have 
YJjLiV{C-\2e n ) \ C (j) ) < 5/2. Then the probability that there are at least 
5n points in the boundary strips can be bounded by the probability that a 
Bin(n, <5/2)-distributed random variable V exceeds Sn. Using a Chernoff bound 
we obtain P (V > Sn) < exp(—5n/8). □ 

The proposition and the lemmas above are used in the analysis of within- 
cluster connectedness. The following proposition deals with between-cluster 
disconnectedness. 

We say that a cluster is isolated if the subgraph of G mut (n, k,t — e n , 5) 
(resp. G sym (n, k,t — e n , 5)) corresponding to cluster is not connected to 
another subgraph corresponding to any other cluster with j 7^ i. Note, 
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that we assume mirij = i ) ... im d.ist(C_ (2e n ), C_ (2e n )) > for all e n < e. 
The following proposition bounds the probability for cluster isolation. This 
bound involves the probability that the maximal /c-nearest-neighbor radius is 
greater than some threshold. Therefore in Lemma [7] we derive a bound for this 
probability. Note that our previous paper [UJ contained an error in the result 
corresponding to Lemma [7j which changed some constants but did not affect 
the main results. 

Proposition 6 (Cluster isolation) Let 2"^ denote the event that the sub- 
graph of the samples in C^l 1 (2e n ) is isolated in G mut (n, k,t — e n ,5). Then given 
that e n < e, k < p^'n/2 — 2log($r^n), we obtain 

n — 1 / p(*' k — 1 | 

P((4 J) ) C ) < P(^>« W )+P(23S) < e"?-yrr-^) +Pfa). 

Let T$ be the event that the subgraph of samples in C_ (2e n ) zs isolated in 
G sym (n, k,t- e n , 5) . Define p min = min^i,...,™ p w and /3 max = max i= i v .. im /%) . 
Then for e n < e n , k < p min n/2 — 2 log(/3 max n) ; we obtain 

P((I«)j < £P(i^L > u«) +P(2^) < me ^M) + p(Dj. 

3=1 

Proof. We have P((Z«) C ) < P((X^) C | P n ) + P(P c n ). Given the 
event T> n , the remaining points in G mut (n, k,t — e n ,5) are samples from 
C_ (2e n ) (j = l,...,m). By assumption we have for e n < e that 
min^yj dist(C_ (2e n ), C_? (2e n )) > izW. In order to have edges from samples 
in C®(2e n ) to any other part in G mnt (n,k,t — e n ,5), it is necessary that 
-Rmlx — '• Using Lemma [7] we can lower bound the probability of this 
event. For the symmetric kNN graph there can be additional edges from sam- 
ples in C ( L > (2e n ) to other parts in the graph if samples lying in C_ (2e n ) 
are among the kNN-neighbors of samples in C_ (2e n ), j ^ i. Let be the 
distance between C_ (2e) and C_ (2e). There can be edges from samples in 
C_ (2e n ) to any other part in G sym (n, k, e n , 5) if the following event holds: 
{i^mlx > w^} U {Uj^j{-R^ x > Using a union bound we obtain, 

P((^) C I < P (^ax > « W ) + EP(^L > 

With iz^ < u u and Lemma[7]we obtain the result for G sy m(n, k,e n , 6). □ 

The following lemma states the upper bound for the probability that the 
maximum /c-nearest neighbor radius R^ x of samples in C_ (2e n ) used in the 
proof of Proposition [6j 
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Lemma 7 (Maximal kNN radius) Let k < p^'n/2 — 2 log(/3(i)n) . Then 

Proof. Define iV s = |{j / s | X,- G B(X S , u®)}\ for 1 < s < n. Then {-R^ ax > 
u {i) } = U" =1 {N S < k - 1 n X s G C { l\2e)}. Thus, 

n 

P(^L > < E P (^ < fc - 1 1 e C W (2e)) P(X S G C W (2e)). 

Let M ~ Bin(n-l,p«). Then P(iV s < fc-l |X S G C W (2e)) < P(M < k-l). 
Using the tail bound from Theorem [5] we obtain for k — 1 < p^ l \n — 1), 

P(f^ ax >« W ) <nj%)P(M<*-l) 

< n(3 {i) e { >\ 2 < e 2 V 2 »-M 

where we use that log(x) > (x — l)/x, that —w/e is the minimum of xlog(x/w) 
attained at x = w/e and (1 — 1/e) > 1/2. Finally, we use that under the stated 
condition on k we have log(n/3(j)) < [(n — l)pW/2 — (k — l)]/2. 

□ 



The following proposition quantifies the rate of exact cluster identification, 
that means how fast the fraction of points from outside the level set L(t) 
approaches zero. 

Proposition 8 (Ratio of boundary and cluster points) Let A? cluster 
and NN ciuster be the number of cluster points and background points in 
G mu t{n,k,t — e n ,5) (resp. G sym (n,k,t — e n ,5)) and let C^ 1 denote the event 
that the points of each cluster form a connected component of the graph. Let 
£n —* for n — ► 00 and define [3 = Y^Li /%) ■ Then there exists a constant 
D > such that for sufficiently large n, 

?(N NoCluster /N cluster > Aje n I Cf ) < e~l D ^ n + er»f + P(V c n ). 

Proof. According to Lemma [9] we can find constants > such that 
p(Cl i; '(2e n )\CW) < D®e n for n sufficiently large, and set D = ET=iD (i) - 
Suppose that V n holds. Then the only points which do not belong to a clus- 
ter lie in the set U^C^ (2e n )\C^ . Some of them might be discarded, but 
since we are interested in proving an upper bound on iV NoClus1:ei . that does not 
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matter. Then with p = E jVjNj ci us ter/w < De n and a = 2De n we obtain with 
Theorem [5] and for sufficiently small e n , 

P(iV NoCluster > 2De n n \ Cf,V n ) < e~ nK ^ < ^-fiP^M, 

where K denotes the Kullback-Leibler divergence. Here we used that for p < 
De n we have if(a||p) > K(a\\De n ) and with log(l + x) > x/(l+x) for x > — 1 
we have K(2De n \ \ De n ) > De n {2 log 2 - 1) > De n /A. Given that V n holds and 
the points of each cluster are a connected component of the graph, we know 
that all cluster points remain in the graph and we have 

P(iVciuster<^|Cf,Pn)<e- n l 

using Theorem [5] and similar arguments to above. □ 

Lemma 9 Assume that p e C 2 (WL d ) with \\p\\oo = Pmax and that for each x in 
a neighborhood of {p = t} the gradient of p at x is non-zero, then there exists 
a constant > such that for e n sufficiently small, 

fi(C^\2e n )\C^)<D^e n . 

Proof. Under the conditions on the gradient and e n small enough, one has 
C { i\2e n ) C C« + Ci£ nJ B(0, 1) for some constant C x . Here " + " denotes set 
addition, that is for sets A and B we define A + B = {a + b\ a£A, b E B}. 
Since the boundary dC® is a smooth (d — l)-dimensional submanifold in M. d 
with a minimal curvature radius «W > o, there exists 71 > and a constant 
C 2 such that vol(CW + e n B(0, 1)) < vo\(C^) + C 2 e n vo\(dC^) for e n < 7l 
(see Theorem 3.3.39 in [7]). Thus, by the additivity of the volume, 

vol (C { l\2e n ) \ C {l) ) < vol + C^nBiO, 1)) - vol 

= CiC 2 vol (9C«)e n . 

Since p is bounded, we obtain, ^C { 1 ] {2e n ) \ C») < d C 2 vol(5CW) 

for e n small enough. Setting Z)W = Ci C2 vol(9CW)p max the result follows. □ 

Noise-free case as special case of the noisy one. In the noise-free case, 
by definition all sample points belong to a cluster. That means 

• we can omit the density estimation step, which was used to remove back- 
ground points from the graph, and drop the event T> n everywhere, 

• we work with L(t) directly instead of L(t — e), 

• we do not need to remove the small components of size smaller than 8n, 
which was needed to get a grip on the "boundary" of L(t — e) \ L(t) . 
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In particular, setting 5 = we trivially have P((£>^) c ) = and P(££) = for 
all % = 1, . . . , m and all riGff. 

As a consequence, we can directly work on the graphs G mut (n, k) and G sym (n, k), 
respectively. Therefore, the bounds we gave in the previous sections also hold 
in the simpler noise-free case and can be simplified in this setting. 



5.2 Proofs of the main theorems 

Proof of Theorem [7| Given we work on the complement of the event T$ of 
Proposition [6j there are no connections in <5 mut (n, k,t — e, 5) between the sub- 
graph containing the points of cluster and points from any other cluster. 
Moreover, by Proposition [l] we know that the event C$ = A$ HjB^ f]£ n r\V n 
implies that the subgraph of all the sample points lying in cluster CW is con- 
nected and all other sample points lying not in in the cluster are either 
discarded or are connected to the subgraph containing all cluster points. That 
means we have identified cluster C'*'. Collecting the bounds from Proposition 
[6]and[lJ we obtain 

P (Cluster not roughly identified in G mut (n, k,t — e, 5)) 

<p((ifr)+p((c«r) 

< p((z£°) c ) + p((^?) c ) + p({B$>y) + p(^) + HK) 

n-1 f P^l _ k-l\ t 

<e 2 \ 2 -V + 2 e 4d+1 + 2e~ n 8 + 3P(V c n ). 

In the noise-free case the events B%\ £ n and T> n can be ignored. The optimal 
choice for k follows by equating the exponents of the bounds for (X^) c and 
{A$) c and solving for k. One gets for the optimal k, 

p (i) p(i) 

k = (n — 1) 5 — t h 1, and a rate of ( n — 1 



Pmax — 



In the noisy case, we know that for n sufficiently large we can take e small 
enough (e is small and fixed) such that the condition YJfLi u(C <y i\2e)\C^) < 
5/2 holds. It is well known that under our conditions on p there exist constants 
Ci,C 2 such that P(V c n ) < e" 02 ^ 2 given h 2 < C x e (cf. Rao [13J ) . Plugging 
this result into the bounds above the rate of convergence is determined by the 
worst exponent, 

. f (n - l)p (i) k - 1 k - 1 t 5 d 2 

mm < , — y— — tt^, n— , Conn e 

I 4 2 4*" p MJ 8' 
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However, since the other bounds do not depend on k the optimal choice for k 
remains the same. □ 



Proof of Theorem \^ Compared to the proof for cluster identification in the 
mutual kNN graph in Theorem [T] the only part which changes is the connec- 
tivity event. Here we have to replace the bound on P((X^) C ) by the bound on 
P((jW) c ) from Proposition [6j With p min = min^i,...^ (M we obtain 

/ <w. x \ "~ 1 | Pmin fc— 1 ) 

P((l^) c )<me 2 \ 2 *-V+P(Z>£). 

Following the same procedure as in the proof of Theorem [T] provides the result 
(for both, the noise-free and the noisy case). □ 

Proof of Theorem^ We set Cf = fliLi^ and Xf = f]T=iW- B Y a slight 
modification of the proof of Proposition LU and p max = max i=i,...,mPmax 

m _ fc— 1 t 

P((C) j < 2 E e P " ax + 2e " nf + 2P (^n) 

i=l 

< 2me~^T^ +2e""t + 2P(Z>£). 
By a slight modification of the proof of Proposition[6]with p min = min i=li m 

/ „ \ ™ _n zl l/pW_fe_i\ n-1 f P min k-l\ 

P((Xf) C ) < J> 2 ^ 2 ""^ + P P n) < me 2 ^ 2 ""^ 
i=l 

Combining these results we obtain 

P Not all Clusters C (i) roughly identified in G mut (n, k, t — e, S) I 

( Pmin fc-1 | fc-1 t s 

<me 2 V 2 »-V + 3P(££) +2me"3+ T ^ +2e" n s. 
The result follows with a similar argumentation to the proof of Theorem [T] □ 

Proof of Theorem [^J Clearly we can choose Eq > such that /i^ < Ce n for a 
suitable constant C > 0. Then there exists a constant C 2 > with P(X>£) < 
e~ C2nh ™ 6 ™. Since 

n*i = /^(^) A (^) A = /*5i«g» 

we have Z)^=i P(^n) < 00 • Moreover, let Q 11 denote the event that the points 
of each cluster form a connected component of the graph. Then it can be 
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easily checked with Proposition [8] that we have Y^=i P(^NoCiu^er/iVciuster > 
ADe n / (3 | C^ 11 ) < oo. Moreover, similar to the proof of Theorem [3] one can show 
that there are constants c 1; c 2 > such that for c\ logn < k < c 2 n cluster 
will be roughly identified almost surely as n — > oo. (Note here that the bounds 
on k for which our probability bounds hold are also logarithmic and linear, 
respectively, in n). Thus, the event Q u occurs almost surely and consequently 
^Nociuster /^cluster -> almost surely □ 



6 Discussion 



In this paper we studied the problem of cluster identification in kNN graphs. 
As opposed to earlier work (Brito et al. [5], Biau et al. [2j) which was only 
concerned with establishing connectivity results for a certain choice of k (resp. 
e in case of an e-neighborhood graph), our goal was to determine for which 
value of k the probability of cluster identification is maximized. Our work 
goes considerably beyond Brito et al. [5] and Biau et al. [2], concerning both 
the results and the proof techniques. In the noise-free case we come to the 
surprising conclusion that the optimal k is rather of the order of c-n than of the 
order of log n as many people had suspected, both for mutual and symmetric 
kNN graphs. A similar result also holds for rough cluster identification in 
the noisy case. Both results were quite surprising to us — our first naive 
expectation based on the standard random geometric graph literature had 
been that k ~ logn would be optimal. In hindsight, our results perfectly make 
sense. The minimal k to achieve within-cluster connectedness is indeed of the 
order logn. However, clusters can be more easily identified the tighter they 
are connected. In an extreme case where clusters have a very large distance 
to each other, increasing k only increases the within-cluster connectedness. 
Only when the cluster is fully connected (that is, k coincides with the number 
of points in the cluster, that is A; is a positive fraction of n), connections to 
other clusters start to arise. Then the cluster will not be identified any more. 
Of course, the standard situation will not be as extreme as this one, but our 
proofs show that the tendency is the same. 

While our results on the optimal choice of k are nice in theory, in practical 
application they are often hard to realize. The higher the constant k in the 
kNN graph is chosen, the less sparse the neighborhood graph becomes, and the 
more resources we need to compute the kNN graph and to run algorithms on 
it. This means that one has to make a trade-off: even if in many applications 
it is impossible to choose k of the order of c • n for computational restrictions, 
one should attempt to choose k as large as one can afford, in order to obtain 
the most reliable clustering results. 
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When comparing the symmetric and the mutual kNN graph, in terms of the 
within-cluster connectedness both graphs behave similar. But note that this 
might be an artifact of our proof techniques, which are very similar in both 
cases and do not really make use of the different structure of the graphs. Con- 
cerning the between-cluster disconnectedness, however, both graphs behave 
very differently. To ensure disconnectedness of one cluster from the other 
clusters in the mutual kNN graph, it is enough to make sure that the nearest 
neighbor of all points of are again elements of C^ l \ In this sense, the 
between-cluster disconnectedness of an individual cluster in the mutual graph 
can be expressed in terms of properties of this cluster only. In the symmetric 
kNN graph this is different. Here it can happen that some other cluster 
links inside no matter how nicely connected is. In particular, this af- 
fects the setting where the goal is to identify the most significant cluster only. 
While this is easy in the mutual kNN graph, in the symmetric kNN graph it is 
not easier than identifying all clusters as the between-cluster disconnectedness 
is governed by the worst case. 

From a technical point of view there are some aspects about our work which 
could be improved. First, we believe that the geometry of the clusters does not 
influence our bounds in a satisfactory manner. The main geometric quantities 
which enter our bounds are simple things like the distance of the clusters 
to each other, the minimal and maximal density on the cluster, and so on. 
However, intuitively it seems plausible that cluster identification depends on 
other quantities as well, such as the shapes of the clusters and the relation of 
those shapes to each other. For example, we would expect cluster identification 
to be more difficult if the clusters are in the forms of concentric rings than if 
they are rings with different centers aligned next to each other. Currently we 
cannot deal with such differences. Secondly, the covering techniques we use 
for proving our bounds are not well adapted to small sample sizes. We first 
cover all clusters completely by small balls, and then require that there is at 
least one sample point in each of those balls. This leads to the unpleasant side 
effect that our results are not valid for very small sample size n. However, we 
did not find a way to circumvent this construction. The reason is that as soon 
as one has to prove connectedness of a small sample of cluster points, one 
would have to explicitly construct a path connecting each two points. While 
some techniques from percolation theory might be used for this purpose in 
the two-dimensional setting, we did not see any way to solve this problem in 
high-dimensional spaces. 

In the current paper, we mainly worked with the cluster definition used in 
the statistics community, namely the connected components of t-level sets. In 
practice, most people try to avoid to perform clustering by first applying den- 
sity estimation — density estimation is inherently difficult on small samples, 
in particular in high-dimensional spaces. On the other hand, we have already 
explained earlier that this inherent complexity of the problem also pays off. In 
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the end, not only have we detected where the clusters are, but we also know 
where the data only consists of background noise. 

In the computer science community, clustering is often solved via partitioning 
algorithms such as mincuts or balanced cuts. Now we have treated the case of 
the level sets in this paper, discussing the graph partitioning case will be the 
next logical step. Technically, this is a more advanced setting. The ingredients 
are no longer simple yes/no events (such as "cluster is connected" or "clusters 
are not connected to each other"). Instead, one has to carefully "count" how 
many edges one has in different areas of the graph. In future work we hope to 
prove results on the optimal choice of k for such a graph partitioning setting. 
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